Customer Segmentation¶
Datos tomados de Customer Segmentation en Kaggle.
Business Context¶
Una empresa de tarjetas de crédito desea segmentar a sus clientes en función de su comportamiento de compra en centros comerciales con el fin de identificar los diferentes tipos de tarjetahabientes y establecer estrategias de marketing personalizadas para cada uno. A continuación, una breve descripción del significado de cada atributo:
| Attribute | Description |
|---|---|
| CUST_ID | Credit card holder ID |
| BALANCE | Monthly average balance (based on daily balance averages) |
| BALANCE_FREQUENCY | Ratio of last 12 months with balance. (1: Frequently updated, 0: Not frequently updated) |
| PURCHASES | Total purchase amount spent during last 12 months |
| ONEOFF_PURCHASES | Total amount of one-off purchases |
| INSTALLMENTS_PURCHASES | Total amount of installment purchases |
| CASH_ADVANCE | Total cash-advance amount |
| PURCHASES_FREQUENCY | Frequency of purchases (Percent of months with at least one purchase). (1: Frequently purchased, 0: Not frequently purchased) |
| ONEOFF_PURCHASES_FREQUENCY | Frequency of one-off-purchases. (1: Frequently purchased, 0: Not frequently purchased) |
| PURCHASES_INSTALLMENTS_FREQUENCY | Frequency of installment purchases. (1: Frequently purchased, 0: Not frequently purchased) |
| CASH_ADVANCE_FREQUENCY | Cash-Advance frequency |
| CASH_ADVANCE_TRX | Average amount per cash-advance transaction |
| PURCHASES_TRX | Average amount per purchase transaction |
| CREDIT_LIMIT | Credit limit |
| PAYMENTS | Total payments (Due amount paid by the customer to decrease their statement balance) in the period |
| MINIMUM_PAYMENTS | Total minimum payments due in the period |
| PRC_FULL_PAYMENT | Percentage of months with full payment of the due statement balance |
| TENURE | Number of months as a customer |
La empresa nos ha contratado para llevar adelante este estudio, el cual realizaremos en estas dos primeras semanas del curso. Las tareas para realizar en cada una de ellas son:
- EDA, data cleaning and preparation.
- Apply a clustering algorithm for customer segmentation to identify behavior patterns.
# Import relevant libraries
import matplotlib.pyplot as plt
import pandas as pd
from ydata_profiling import ProfileReport
# plt.switch_backend('Qt5Agg')
%matplotlib inline
plt.isinteractive()
True
Let's get started by loading the data in CSV format. Review column names and data types.
raw_data = pd.read_csv("./data/Customer_Data.csv")
raw_data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 8950 entries, 0 to 8949 Data columns (total 18 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 CUST_ID 8950 non-null object 1 BALANCE 8950 non-null float64 2 BALANCE_FREQUENCY 8950 non-null float64 3 PURCHASES 8950 non-null float64 4 ONEOFF_PURCHASES 8950 non-null float64 5 INSTALLMENTS_PURCHASES 8950 non-null float64 6 CASH_ADVANCE 8950 non-null float64 7 PURCHASES_FREQUENCY 8950 non-null float64 8 ONEOFF_PURCHASES_FREQUENCY 8950 non-null float64 9 PURCHASES_INSTALLMENTS_FREQUENCY 8950 non-null float64 10 CASH_ADVANCE_FREQUENCY 8950 non-null float64 11 CASH_ADVANCE_TRX 8950 non-null int64 12 PURCHASES_TRX 8950 non-null int64 13 CREDIT_LIMIT 8949 non-null float64 14 PAYMENTS 8950 non-null float64 15 MINIMUM_PAYMENTS 8637 non-null float64 16 PRC_FULL_PAYMENT 8950 non-null float64 17 TENURE 8950 non-null int64 dtypes: float64(14), int64(3), object(1) memory usage: 1.2+ MB
raw_data.head()
| CUST_ID | BALANCE | BALANCE_FREQUENCY | PURCHASES | ONEOFF_PURCHASES | INSTALLMENTS_PURCHASES | CASH_ADVANCE | PURCHASES_FREQUENCY | ONEOFF_PURCHASES_FREQUENCY | PURCHASES_INSTALLMENTS_FREQUENCY | CASH_ADVANCE_FREQUENCY | CASH_ADVANCE_TRX | PURCHASES_TRX | CREDIT_LIMIT | PAYMENTS | MINIMUM_PAYMENTS | PRC_FULL_PAYMENT | TENURE | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | C10001 | 5323.148883 | 0.250693 | 26344.072201 | 38237.442525 | 3727.113162 | 15708.239684 | 0.496536 | 0.464442 | 0.821611 | 0.194502 | 100 | 77 | 16819.480037 | 21337.027458 | 39452.958121 | 0.817907 | 7 |
| 1 | C10002 | 12726.638115 | 0.791307 | 37958.519019 | 5690.742440 | 18733.810964 | 38284.354433 | 0.699457 | 0.250327 | 0.654863 | 1.083902 | 78 | 156 | 15617.570575 | 8000.183624 | 63013.748477 | 0.343119 | 9 |
| 2 | C10003 | 4305.572068 | 0.176531 | 28392.953338 | 36009.470088 | 2873.383232 | 14294.185035 | 0.419764 | 0.523662 | 0.899912 | 0.207049 | 72 | 81 | 15515.586213 | 27111.360493 | NaN | 0.829074 | 6 |
| 3 | C10004 | 4740.988511 | 0.178076 | 27399.003842 | 38246.863491 | 3402.853375 | 6936.812518 | 0.439666 | 0.606597 | 0.783129 | 0.228299 | 78 | 83 | 12926.587974 | 23919.113404 | 38444.219979 | 0.883984 | 7 |
| 4 | C10005 | 13820.920640 | 0.826914 | 42214.021633 | 7341.007821 | 19273.070991 | 40091.347849 | 0.821412 | 0.283579 | 0.501361 | 1.106350 | 88 | 182 | 14404.705067 | 6994.688474 | 62041.617340 | 0.383186 | 10 |
Data Processing¶
Let's check for duplicates and missing values in the data.
raw_data.duplicated().sum()
np.int64(0)
raw_data.isna().sum()
CUST_ID 0 BALANCE 0 BALANCE_FREQUENCY 0 PURCHASES 0 ONEOFF_PURCHASES 0 INSTALLMENTS_PURCHASES 0 CASH_ADVANCE 0 PURCHASES_FREQUENCY 0 ONEOFF_PURCHASES_FREQUENCY 0 PURCHASES_INSTALLMENTS_FREQUENCY 0 CASH_ADVANCE_FREQUENCY 0 CASH_ADVANCE_TRX 0 PURCHASES_TRX 0 CREDIT_LIMIT 1 PAYMENTS 0 MINIMUM_PAYMENTS 313 PRC_FULL_PAYMENT 0 TENURE 0 dtype: int64
There are no duplicate entries, but two columns are missing at least one value: CREDIT_LIMIT (1) and MINIMUM_PAYMENTS (313). We will handle these later.
Exploratory Data Analysis (EDA)¶
raw_data.describe()
| BALANCE | BALANCE_FREQUENCY | PURCHASES | ONEOFF_PURCHASES | INSTALLMENTS_PURCHASES | CASH_ADVANCE | PURCHASES_FREQUENCY | ONEOFF_PURCHASES_FREQUENCY | PURCHASES_INSTALLMENTS_FREQUENCY | CASH_ADVANCE_FREQUENCY | CASH_ADVANCE_TRX | PURCHASES_TRX | CREDIT_LIMIT | PAYMENTS | MINIMUM_PAYMENTS | PRC_FULL_PAYMENT | TENURE | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 8950.000000 | 8950.000000 | 8950.000000 | 8950.000000 | 8950.000000 | 8950.000000 | 8950.000000 | 8950.000000 | 8950.000000 | 8950.000000 | 8950.000000 | 8950.000000 | 8949.000000 | 8950.000000 | 8637.000000 | 8950.000000 | 8950.000000 |
| mean | 9382.387900 | 0.459205 | 30525.873439 | 22913.665720 | 11407.851805 | 22573.474202 | 0.559810 | 0.478032 | 0.623916 | 0.734946 | 76.169162 | 164.981453 | 14696.640361 | 19824.347218 | 44222.977795 | 0.545414 | 8.370615 |
| std | 5118.113559 | 0.282998 | 11041.410943 | 13573.919086 | 7087.102996 | 12594.295895 | 0.214055 | 0.221873 | 0.208207 | 0.422563 | 22.260935 | 71.322003 | 5537.074673 | 11069.268177 | 15468.314241 | 0.267443 | 1.756035 |
| min | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 50.000000 | 0.000000 | 1006.064965 | 0.000000 | 6.000000 |
| 25% | 4275.036859 | 0.188490 | 27106.651600 | 6698.252490 | 3736.720194 | 10944.431111 | 0.407273 | 0.250856 | 0.518231 | 0.230716 | 65.000000 | 97.000000 | 13487.727313 | 9329.603116 | 31572.127683 | 0.336732 | 7.000000 |
| 50% | 12007.718812 | 0.306650 | 30934.582274 | 31011.033488 | 10832.141085 | 19465.792260 | 0.505305 | 0.533521 | 0.607608 | 0.947651 | 82.000000 | 176.000000 | 14942.545347 | 21367.621276 | 48780.271754 | 0.432690 | 8.000000 |
| 75% | 13740.565876 | 0.756061 | 39825.160798 | 34495.416423 | 19127.083036 | 36150.722564 | 0.777084 | 0.649120 | 0.814929 | 1.029172 | 92.000000 | 216.000000 | 16269.727124 | 25219.127375 | 57761.175985 | 0.815178 | 10.000000 |
| max | 19043.138560 | 1.000000 | 49039.570000 | 40761.250000 | 22500.000000 | 47137.211760 | 1.000000 | 1.000000 | 1.000000 | 1.500000 | 123.000000 | 358.000000 | 30000.000000 | 50721.483360 | 76406.207520 | 1.000000 | 12.000000 |
There are large scale differences among features. Ranges vary from [0-1] to [0-80,000]. - This needs to be handled as we will use distance-based algorithms whose output degrade when the feature scales vary significantly.
ProfileReport(raw_data)
Summarize dataset: 0%| | 0/5 [00:00<?, ?it/s]
Generate report structure: 0%| | 0/1 [00:00<?, ?it/s]
Render HTML: 0%| | 0/1 [00:00<?, ?it/s]